De‐novo assembly of four rail (Aves: Rallidae) genomes: A resource for comparative genomics

Abstract Rails are a phenotypically diverse family of birds that includes 130 species and displays a wide distribution around the world. Here we present annotated genome assemblies for four rails from Aotearoa New Zealand: two native volant species, pūkeko Porphyrio melanotus and mioweka Gallirallus philippensis, and two endemic flightless species takahē Porphyrio hochstetteri and weka Gallirallus australis. Using the sequence read data, heterozygosity was found to be lowest in the endemic flightless species and this probably reflects their relatively small populations. The quality checks and comparison with other rallid genomes showed that the new assemblies were of good quality. This study significantly increases the number of available rallid genomes and will enable future genomic studies on the evolution of this family.


| INTRODUC TI ON
Rails (Aves: Rallidae) are a phenotypically diverse family of primarily terrestrial birds with relatively short wings and strong, variably elongated bills (Livezey, 2003;Ripley et al., 1977;Taylor, 1998).Despite the terrestrial lifestyle of the majority of the species (Taylor, 1998), this bird family displays remarkable dispersal capacity resulting in broad distribution and the colonisation of numerous oceanic islands (Garcia-R et al., 2017;Olson, 1973;Ripley et al., 1977).At the same time, more than 30 flightless rail species are known (Kirchman, 2012;Steadman, 1995) and a large proportion of them are endemic to single oceanic islands, demonstrating that their ancestors had been volant (Trewick, 1997a(Trewick, , 1997b)).The high proportion of flightless species as well as the fact that flightlessness evolved many times among extant rails provides a suitable system with which to study genomic changes associated with maintenance and loss of flight in birds.
Moreover, it has been shown that these differences are independent of phylogeny and instead demonstrate convergent evolution associated with a walking ecology (Gaspar et al., 2020).Despite some research using short markers at the population level (Garcia-R et al., 2017;Garcia-R & Trewick, 2014;Trewick et al., 2017), the molecular basis underlying the convergent evolution of flightless rails remains unknown.
To investigate that question, more genomic data are needed.Here we present new, annotated rail genome assemblies of four rail species from Aotearoa New Zealand; two volant, purple swamphen (called pūkeko in Aotearoa New Zealand) Porphyrio melanotus melanotus (Temminck, 1820), and buff-banded rail (also called mioweka and moho pererū) Gallirallus philippensis assimilis (Linnaeus, 1766), and two flightless species, takahē Porphyrio hochstetteri (Meyer, 1883), and weka Gallirallus australis australis (Sparrman, 1786) (Clements et al., 2023).These four genome assemblies were generated to provide two volant-flightless pairs of closely related living species that will enable future genomic comparisons to highlight the differences and similarities in evolutionary trends between rails with and without the ability to fly (Figure 1).

| DNA extraction and sequencing
DNA was extracted from muscle tissue samples of four rails sampled in Aotearoa New Zealand: Porphyrio melanotus, Gallirallus philippensis, Porphyrio hochstetteri, and Gallirallus australis.Permission to obtain roadkill specimens is given by Department of Conservation Authority WA-17590-DOA.Full sample details can be found in Table 1.Extraction used the Geneaid© Genomic DNA Mini Kit following the kit instructions and eluted in 100 μL.DNA quality was then verified by gel electrophoresis and quantified using Qubit 2.0.Library preparation using the TruSeq Nano DNA kit and quality check were performed by the Massey University Genome Service (New Zealand) with sequencing by Novagene (Hong Kong).Libraries were sequenced on the Illumina HiSeq™ X platform generating non-overlapping 150 bp paired-end reads with an insert size of 550 bp.Fastp V0.19.4 (Chen et al., 2018) was used with default settings for paired-end data to trim the adapters as well as filter and assess the read quality.

| Genome assembly
De novo assembly was performed for each of the genomes using Meraculous (Chapman et al., 2011).Average insert size, standard deviation, and average read lengths were estimated using sequence reads mapped to a nuclear gene of a close species.Following the Meraculous manual instructions, a range of k-mer sizes were analysed using KmerGenie V1.7051 (Chikhi & Medvedev, 2014).The k-mer frequency histograms were reviewed and we selected k that had a main haploid peak with at least 30× coverage and a distinct trough to its left that was at most 1/10 of the peak height.
These were 61, 87, 61, and 57 for respectively Porphyrio melanotus, Porphyrio hochstetteri, Gallirallus philippensis, and Gallirallus australis (Figure 2).High heterozygosity for G. philippensis meant that ideal peak height/trough specs could not be met but the assembly was still successful.See supplementary data for full details of settings used in all Meraculous runs.F I G U R E 1 Maximum likelihood (RAxML V.8) phylogeny of three volant (black) and five flightless (red) rail lineages (Aves: Rallidae) based on 10 concatenated nuclear genes analysed with Balearica regulorum crane (Aves: Gruidae) (grey) as outgroup; bootstrap supports are indicated for each node.

| Additional genomes
In order to assess the quality of our genome assemblies, we compared them to a selection of additional rail genomes, Okinawa rail Gallirallus okinawae (also known as Hypotaenidia okinawae), GenBank

| Quality assessment
Meraculous outputs were used to compare the sequence length of the shortest scaffold at 50% of the total genome length (N50) and the smallest number of scaffolds whose total length makes up half of the genome size (L50) values as well as the assembly length and the number of contigs and scaffolds.Busco v4 (Seppey et al., 2019) was implemented using a Docker (Merkel, 2014) container (default parameters, mode: genome) on the genomes using the aves_odb10 dataset to assess the assembly completeness.

| Genome annotation
Geneious R.11 (https:// www.genei ous.com) was used to extract the coding sequences (CDS) from B. regulorum genome (GCA_000709895) and these were filtered to retain only the longest CDS per gene where multiple annotations existed.Gmap (version 2019-09-12) (Wu & Watanabe, 2005) was used to annotate the newly assembled genomes.Each assembly was first indexed using the gmap_buil function, and then B. regulorum CDS were mapped to it with the setting -f 2 to obtain a GFF3 formatted annotation.

| Extracting coding regions
During the assembly process, exons from the same gene are sometimes assembled into different scaffolds.To obtain a sequence list containing the entire coding region for each gene, the exons were extracted using Geneious R.11 and remapped to the B. regulorum CDS with BWA (0.7.17-r1188) using BWA-mem with the default settings (Li, 2013).
To assess the size and quality of the extracted CDS for each genome they were compared to the B. regulorum reference.The quality (complete or partial) of coding regions retrieved was assessed using the samtools V.1.9(Li et al., 2009) faidx tool (to obtain the length of each sequence) and a custom R script to compare the CDS sequences with the reference (see supplementary data).

| Phylogeny
A basic phylogenetic inference was performed to show relative relationships between the four new genomes and other selected rails with Balearica regulorm as outgroup.Ten genes selected from a set of universal nuclear markers suitable for avian phylogenetic reconstruction (Liu et al., 2018) were used to construct the phylogenetic tree.The genes were ADNP, BEGAIN, INO80D, KBTBD8, NCOA6, RHOBTB1, S1PR3, SPECC1L, ZNF618, and ZNF654.These 10 CDS alignments were concatenated into a 21,390 bp alignment using Phyluce v1.7.1 (Faircloth, 2016) with the default settings and the best-fit partitioning scheme was determined using PartitionFinder2 (Lanfear et al., 2017) via the CIPRES Science Gateway (Miller et al., 2010).A list of genes and partitions can be found in the supplementary data.Maximum Likelihood (ML) analyses were implemented in RaxML v8.2.10 (Stamatakis, 2014) via the CIPRES Science Gateway with bootstrapping automatically stopped employing the majority rule criterion.The consensus tree was then visualised in Geneious (Figure 1).

| DNA extraction and sequencing
The raw data comprised between 780 million (G.philippensis) and 936 million (G.australis) paired reads per species.Most of these were retained after the filtering and cleaning step (Table 2).Fastp generates a Phred quality score (Q score) for each of the species that represents the ratio of bases with a probability of containing no more than 1/100 (Q20) or in 1/1000 (Q30) errors (Ewing et al., 1998;Ewing & Green, 1998;Richterich, 1998).These scores range between 97.37% F I G U R E 2 K-mer frequency in four rails from Aotearoa New Zealand.K-mer (nucleotide sequence of a certain length) were 57, 61, 61, 87 for Gallirallus australis, Gallirallus philippensis, Porphyrio melanotus and Porphyrio hochstetteri respectively.In each distribution, two main peaks correspond to the genomic K-mers for the heterozygous (left) and homozygous (right) parts of the genome.The single main peak of P. hochstetteri indicates high homozygosity.Low depth peaks corresponding to erroneous K-mer populations have been masked for clarity.
Icons indicate flightless and volant species.
and 98.5% for Q20 and between 93.74% and 95.23% for Q30 implying high sequencing quality for all four species.
K-mer frequency plots can be used to estimate the level of heterozygosity for each individual and by proxy each species.Indeed, k-mers from the heterozygous regions (left peak on Figure 2) will have half the sequencing coverage (i.e., K-mer frequency) compared to the homozygous regions (right peak).The higher the left peak the higher the heterozygosity.The two volant species G. philippensis and P. melanotus exhibited high heterozygosity with the left peak being higher than the right for G. philippensis.A very low left peak was found for the G. australis data and only one peak was observed for P. hochstetteri.This implies a much lower level of heterozygosity for both of the endemic, flightless species that have limited populations.
In addition to the K-mer frequencies, the relative heterozygosities among the newly assembled genomes were compared by mapping the paired reads to a set of 20 genes for each species.The ratio of heterozygous sites divided by the total sequence length was calculated (Figure 3).The two volant species showed a higher heterozygosity level than the two flightless species.Based on the paired reads mapping, the mean depth of coverage was calculated for each species (Table 3) with the overall average being 96.4x.
This was similar to the previously assembled rails (between 1.11 and 1.27 Gb, see Table 2) and slightly shorter than the crane B. regulorum (1.22 Gb).
BUSCO scores were similar for P. melanotus, P. hochstetteri, and G. australis with close to 80% of single copy genes which were found complete.In contrast, G. philippensis comprised 69% of "Complete single copy" and had a higher proportion (17%) of missing genes (Table 3).A chromosome-level assembly of P. hochstetteri became available while this work was in progress.Unsurprisingly, the comparison between both assemblies shows that the long-read sequencing technology yields a higher BUSCO score than the short-read one.
Nevertheless, the new assemblies have comparable BUSCO scores to other short-read rail genome assemblies.

| Extracting coding regions
For the four new rail assemblies, the coding regions of each gene were extracted based on the annotations and compared with the respective B. regulorum CDS.Over 9000 gene CDSs were retrieved near-complete (above 95% of the reference CDS nucleotide sequence length) for the two Porphyrio species and Gallirallus australis (Figure 4).G. philippensis exhibited a slightly lower proportion (8259 CDS over 95%) which was consistent with the BUSCO results.The CDSs present in the reference genome but not in the rail data ("Not found" in Figure 4) represent less than 7.5% of the CDS for all species.

| Phylogeny
A maximum likelihood phylogenetic inference was made based on 10 genes selected from a set of universal avian markers (Liu et al., 2018) (Figure 1).The generated tree is consistent with the previously pub-

| DISCUSS ION
We present a straightforward method, based on limited sequencing resources, to generate genomic data for comparative analyses.
Although this method does not result in chromosome level assemblies, it does not require a reference genome, is easily reproducible, and retrieves a significant proportion of the coding regions.
Considerable variation was observed between species heterozygosity (Figures 2 and 3).Indeed, the two volant species were more heterozygous than the flightless ones (Figure 3) with big differences being observed between the most heterozygous species, G. philippensis (frequency of heterozygous site of 0.01) and the least heterozygous species P. hochstetteri (0.0002).Those observations were consistent with the K-mer frequencies (Figure 2).The low heterozygosity in flightless species probably reflects their much reduced populations owing to habitat loss and invasive predators (Baker et al., 1995;Burga et al., 2017;White et al., 2018).The takahē P. hoschetteri is a critically endangered flightless species with a population of only 500 in 2023 (www.doc.govt.nz), all derived from a remnant discovered in the 1950s that may have numbered as low as two individuals (Wallace, 2002).The resulting inbreeding depression likely explains its extremely low level of heterozygosity (Grueber et al., 2010).Gallirallus philippensis on the other hand is a relatively abundant species with a geographic range that includes the islands of Aotearoa New Zealand and the western Pacific (Garcia-R et al., 2017;Trewick, 1997b) which is likely to maintain high heterozygosity at the species level.
The four newly assembled genomes have similar or better characteristics than the other rail genomes assembled from Illumina HiSeq data (Table 3) with N50 and L50 scaffolds within the same range as these other rails.The BUSCO results (Table 3) and CDS extractions (Figure 4) showed similar trends and add to our confidence that the genome assemblies are of good quality with limited assembly errors.Despite being naturally more fragmented than those assembled using long-read sequencing technology (Table 3), a significant proportion of full-length coding regions were identified and extracted showing good utility for future comparative analyses (Figure 4).
Among the four newly assembled rallid genomes, G. philippensis had the lowest proportion of complete genes according to both the BUSCO (Table 3) and extracted CDS comparison (Figure 4).This can be attributed to the high heterozygosity level which generally makes the assembly process more challenging due to the increased complexity of the de Bruijn graph structure (Kajitani et al., 2014).
Nonetheless, the G. philippensis genome is a good quality assembly that can be used to investigate evolutionary processes along with the three other assembled genomes.For all four genomes, a large majority of the genes were retrieved.In all species, over 70% of the genes were identified with greater than 75% completeness.
To conclude, we provide here four new avian assemblies which assembly accession: GCA_027925045.1, Henderson crake Zapornia atra (formerly Porzana atra) GCA_013400835.1, Eurasian coot Fulica atra GCA_013372525.1, Inaccessible Island rail Atlantisia rogersi GCA_013401215.1, and takahē Porphyrio hochstetteri GCA_020800305.1.The chromosome-level takahē genome assembly was released while this work was in preparation and is included in the comparative analysis.The genome of a grey crowned crane Balearica regulorum (order Gruiformes, family Gruidae; Bennett, 1834) GCA_011004875.1 was used as a reference for the gene annotations.
represent valuable genomic resources to investigate evolutionary processes within the rail family.The quality checks that were performed showed that the generated assemblies are reliable.Comparing the newly assembled genomes showed lower levels of heterozygosity in flightless species which likely reflects their relatively small populations.This study significantly increases the number of available rallid genomes, targeting flying-flightless pairs; this creates new opportunities to investigate the evolution of avian flightlessness.AUTH O R CO NTR I B UTIO N S Julien Gaspar: Conceptualization (equal); data curation (lead); formal analysis (lead); investigation (lead); methodology (lead); software (equal); visualization (lead); writing -original draft (lead); writingreview and editing (equal).Steve A. Trewick: Conceptualization (supporting); data curation (supporting); formal analysis (supporting); funding acquisition (supporting); methodology (supporting); supervision (supporting); writing -original draft (supporting); writing -review and editing (equal).Gillian C. Gibb: Conceptualization (equal); data curation (supporting); formal analysis (supporting); funding acquisition (lead); investigation (supporting); methodology (equal); project administration (lead); software (equal); supervision (lead); writing -original draft (supporting); writing -review and editing (equal).ACK N OWLED G EM ENTSThis study was supported by the New Zealand Marsden Fund Council from Government funding, managed by the Royal Society Te Apārangi, grant MAU1601 to GCG.Thanks to Roger Moraga for initial bioinformatic discussions and assistance using the software Meraculous.The genome assemblies were generated with the help of Genomics for Aotearoa New Zealand (GFANZ, genomics.nz),thanks F I G U R E 4 Completeness of CDSs retrieved from eight rail genomes compared to the reference crane Balearica regulorum genome that has a total of 15,035 annotated CDSs.Colours indicate the proportion of genes retrieved from a sample at various scales of completeness.
Fastp outputs after the sequencing of four rail species indicating the number of reads before and after filtering as well as the quality assessment.Average heterozygosity at 20 randomly selected genes from four newly assembled and annotated rail genomes (average total length 266,456 bp).Heterozygosity is the proportion of total nucleotide sites per individual site having two bases.Icons indicate flightless and volant species.De novo genome assembly metrics among 8 rail species and one crane (Balearica regulorum).
lished rallid phylogenies (Garcia-R et al., 2014a, 2014b; Kirchman TA B L E 2 F I G U R E 3 TA B L E 3